In September 2020, Twitter released a new API endpoint and modified the payload of the new twitter objects. The new payload now includes several useful fields that were not previously available (e.g., like counts and impression data)
The following repo provides Python modules to query the new Twitter API, instructions on how to create a locally-hosted MySQL database, modules to translate the data to SQL, and modules to create interactive network graphs.
Storing data in MySQL provides the requisite structure to efficiently and effortlessly perform network analysis of mention, hashtag, or URLs networks.
| ## Data Collection |
The analytic workflow starts with collecting data the V2 Twitter API by using functions found in the api_wrapper module. In order for user to access the Twitter API, users must first request a bearer token from the Twitter developer portal. |
In this example, we query a set of Twitter accounts that were previously identified to post pro or anti-vaccination content by calling the get_users function. Seed user names are provided in the seed_accounts.csv file in the data folder on this GitHub repository. |
| Since, the Twitter API rate limit for user look up queries is 900 accounts per 15 minute, the DataCollection Notebook incorporates sleep time to prevent time outs. The workflow also incorporates error handling for empty responses that result when querying deleted or suspended accounts. |
Next, we query the seed accounts for their last 200 posts by calling the get_user_activity function. Since useful information (e.g., public metrics) is nested within the user activity json objects, the get_user_activity function extracts and appends data into additional columns where possible. |
Likewise, we call the extract_el function to parse mentions, hashtags, and URLs into a directed edgelist. Source nodes in this edge list record the author_id and author_screen_name of users who mention entities (i.e., users, hashtags, or URLs) in a post. Since users can mention several entities in individual posts, the activity data has a one-to-many relationship with the edge list. In this example, 783 seed users mentioned 49,022 users. |
| Although we did not peform any additional data collection beyond what was mentioned, we could easily incorporate additional steps if needed. For example, we could collect user profile information for the 49,022 mentioned users. We could incorporate information from additional sources (e.g., query the Botometer API for bot likelihood scores). We could expand our analysis to a 2-hop network by querying mentioned users for account activity. |
To make the api_wrapper module more effective, we need to improve error handling, identify optimal sleep times for deailing with rate limits, transition the API functions to a class object, improve documentation of available Twitter object fields, and write API response to json files as opposed to csv files. |
In retrospect, we would have preferred to create a cloud-hosted MySQL database using Amazon’s Relational Database Service (RDS). However, we created Docker container to locally host a MySQL database.
To do this, we created an empty directory and added the following Docker .yml file to it:
Then, we navigated to the directory with the .yml file and executed the docker compose up command.
| ## Translating Data to MySQL |
| The DataETL_ToMySQL notebook provides an overview of the steps we used to create SQL table schemas and write .csv data files to our database. The entity relationship diagram of our database is shown below: |
| Below is an example of an ETL workflow for the user table.First, we create a SQL table where we specify the data types for each of the fields. |
| Next, we load the data we collected in the previous step and transform fields where appropriate (e.g., converting dates to date objects). In this example, we also had to replace apostrophes found within text fields with spaces because including them caused errors in the write step. |
| Last, we write the data frame, row-by-row, to the SQL database. |
Now that the data is stored in a SQL database, we can query the data using pymysql package and network_vis_helper module. A simple example is provided in the DataAnalysisVisualization notebook to demonstrate how users can interact with the database and highlight the benefit of writing more effective queries (i.e., not using a SELECT * approach).
For example, if we want to view the top-10 most liked tweets, it takes over 5 seconds to load the entire statuses table into memory and use Python to filter the data to the top 10-records.
In contrast, if we answer this question using only SQL commands, it executes in less than a second (~0.12 seconds)
Another advantage is that we can leverage the structure of the edgelist table to generate network visualizations much easier. For example, we can quickly generate a graph of the most popular hash tags as follows:
A static image of the graph is provided below; however, the html-version of this graph allows a user to explore the graph interactively (e.g., zoom, reposition nodes, etc.). An interactive version of the graph can be viewed at the following link.